20 research outputs found

    Controlled generation in example-based machine translation

    Get PDF
    The theme of controlled translation is currently in vogue in the area of MT. Recent research (Schšaler et al., 2003; Carl, 2003) hypothesises that EBMT systems are perhaps best suited to this challenging task. In this paper, we present an EBMT system where the generation of the target string is filtered by data written according to controlled language specifications. As far as we are aware, this is the only research available on this topic. In the field of controlled language applications, it is more usual to constrain the source language in this way rather than the target. We translate a small corpus of controlled English into French using the on-line MT system Logomedia, and seed the memories of our EBMT system with a set of automatically induced lexical resources using the Marker Hypothesis as a segmentation tool. We test our system on a large set of sentences extracted from a Sun Translation Memory, and provide both an automatic and a human evaluation. For comparative purposes, we also provide results for Logomedia itself

    Robust large-scale EBMT with marker-based segmentation

    Get PDF
    Previous work on marker-based EBMT [Gough & Way, 2003, Way & Gough, 2004] suffered from problems such as data-sparseness and disparity between the training and test data. We have developed a large-scale robust EBMT system. In a comparison with the systems listed in [Somers, 2003], ours is the third largest EBMT system and certainly the largest English-French EBMT system. Previous work used the on-line MT system Logomedia to translate source language material as a means of populating the system’s database where bitexts were unavailable. We derive our sententially aligned strings from a Sun Translation Memory (TM) and limit the integration of Logomedia to the derivation of our word-level lexicon. We also use Logomedia to provide a baseline comparison for our system and observe that we outperform Logomedia and previous marker-based EBMT systems in a number of tests

    Example-based controlled translation

    Get PDF
    The first research on integrating controlled language data in an Example-Based Machine Translation (EBMT) system was published in [Gough & Way, 2003]. We improve on their sub-sentential alignment algorithm to populate the system’s databases with more than six times as many potentially useful fragments. Together with two simple novel improvements—correcting mistranslations in the lexicon, and allowing multiple translations in the lexicon—translation quality improves considerably when target language translations are constrained. We also develop the first EBMT system which attempts to filter the source language data using controlled language specifications. We provide detailed automatic and human evaluations of a number of experiments carried out to test the quality of the system. We observe that our system outperforms Logomedia in a number of tests. Finally, despite conflicting results from different automatic evaluation metrics, we observe a preference for controlling the source data rather than the target translations

    Example-based machine translation using the marker hypothesis

    Get PDF
    The development of large-scale rules and grammars for a Rule-Based Machine Translation (RBMT) system is labour-intensive, error-prone and expensive. Current research in Machine Translation (MT) tends to focus on the development of corpus-based systems which can overcome the problem of knowledge acquisition. Corpus-Based Machine Translation (CBMT) can take the form of Statistical Machine Translation (SMT) or Example-Based Machine Translation (EBMT). Despite the benefits of EBMT, SMT is currently the dominant paradigm and many systems classified as example-based integrate additional rule-based and statistical techniques. The benefits of an EBMT system which does not require extensive linguistic resources and can produce reasonably intelligible and accurate translations cannot be overlooked. We show that our linguistics-lite EBMT system can outperform an SMT system trained on the same data. The work reported in this thesis describes the development of a linguistics-lite EBMT system which does not have recourse to extensive linguistic resources. We apply the Marker Hypothesis (Green, 1979) — a psycholinguistic theory which states that all natural languages are ‘marked’ for complex syntactic structure at surface form by a closed set of specific lexemes and morphemes. We use this technique in different environments to segment aligned (English, French) phrases and sentences. We then apply an alignment algorithm which can deduce smaller aligned chunks and words. Following a process similar to (Block, 2000), we generalise these alignments by replacing certain function words with an associated tag. In so doing, we cluster on marker words and add flexibility to our matching process. In a post hoc stage we treat the World Wide Web as a large corpus and validate and correct instances of determiner-noun and noun-verb boundary friction. We have applied our marker-based EBMT system to different bitexts and have explored its applicability in various environments. We have developed a phrase-based EBMT system (Gough et al., 2002; Way and Gough, 2003). We show that despite the perceived low quality of on-line MT systems, our EBMT system can produce good quality translations when such systems are used to seed its memories. (Carl, 2003a; Schaler et al., 2003) suggest that EBMT is more suited to controlled translation than RBMT as it has been known to overcome the ‘knowledge acquisition bottleneck’. To this end, we developed the first controlled EBMT system (Gough and Way, 2003; Way and Gough, 2004). Given the lack of controlled bitexts, we used an on-line MT system Logomedia to translate a set of controlled English sentences, We performed experiments using controlled analysis and generation and assessed the performance of our system at each stage. We made a number of improvements to our sub-sentential alignment algorithm and following some minimal adjustments to our system, we show that our controlled EBMT system can outperform an RBMT system. We applied the Marker Hypothesis to a more scalable data set. We trained our system on 203,529 sentences extracted from a Sun Microsystems Translation Memory. We thus reduced problems of data-sparseness and limited our dependence on Logomedia. We show that scaling up data in a marker-based EBMT system improves the quality of our translations. We also report on the benefits of extracting lexical equivalences from the corpus using Mutual Information

    Teaching and assessing empirical approaches to machine translation

    Get PDF
    Empirical methods in Natural Language Processing (NLP) and Machine Translation (MT) have become mainstream in the research field. Accordingly, it is important that the tools and techniques in these paradigms be taught to potential future researchers and developers in University courses. While many dedicated courses on Statistical NLP can be found, there are few, if any courses on Empirical Approaches to MT. This paper presents the development and assessment of one such course as taught to final year undergraduates taking a degree in NLP

    wEBMT: developing and validating an example-based machine translation system using the world wide web

    Get PDF
    We have developed an example-based machine translation (EBMT) system that uses the World Wide Web for two different purposes: First, we populate the system’s memory with translations gathered from rule-based MT systems located on the Web. The source strings input to these systems were extracted automatically from an extremely small subset of the rule types in the Penn-II Treebank. In subsequent stages, the (source, target) translation pairs obtained are automatically transformed into a series of resources that render the translation process more successful. Despite the fact that the output from on-line MT systems is often faulty, we demonstrate in a number of experiments that when used to seed the memories of an EBMT system, they can in fact prove useful in generating translations of high quality in a robust fashion. In addition, we demonstrate the relative gain of EBMT in comparison to on-line systems. Second, despite the perception that the documents available on the Web are of questionable quality, we demonstrate in contrast that such resources are extremely useful in automatically postediting translation candidates proposed by our system

    Example-based machine translation via the web

    Get PDF
    One of the limitations of translation memory systems is that the smallest translation units currently accessible are aligned sentential pairs. We propose an example-based machine translation system which uses a 'phrasal lexicon' in addition to the aligned sentences in its database. These phrases are extracted from the Penn Treebank using the Marker Hypothesis as a constraint on segmentation. They are then translated by three on-line machine translation (MT) systems, and a number of linguistic resources are automatically constructed which are used in the translation of new input. We perform two experiments on testsets of sentences and noun phrases to demonstrate the effectiveness of our system. In so doing, we obtain insights into the strengths and weaknesses of the selected on-line MT systems. Finally, like many example-based machine translation systems, our approach also suffers from the problem of ‘boundary friction’. Where the quality of resulting translations is compromised as a result, we use a novel, post hoc validation procedure via the World Wide Web to correct imperfect translations prior to their being output to the user

    Status Update and Interim Results from the Asymptomatic Carotid Surgery Trial-2 (ACST-2)

    Get PDF
    Objectives: ACST-2 is currently the largest trial ever conducted to compare carotid artery stenting (CAS) with carotid endarterectomy (CEA) in patients with severe asymptomatic carotid stenosis requiring revascularization. Methods: Patients are entered into ACST-2 when revascularization is felt to be clearly indicated, when CEA and CAS are both possible, but where there is substantial uncertainty as to which is most appropriate. Trial surgeons and interventionalists are expected to use their usual techniques and CE-approved devices. We report baseline characteristics and blinded combined interim results for 30-day mortality and major morbidity for 986 patients in the ongoing trial up to September 2012. Results: A total of 986 patients (687 men, 299 women), mean age 68.7 years (SD ± 8.1) were randomized equally to CEA or CAS. Most (96%) had ipsilateral stenosis of 70-99% (median 80%) with contralateral stenoses of 50-99% in 30% and contralateral occlusion in 8%. Patients were on appropriate medical treatment. For 691 patients undergoing intervention with at least 1-month follow-up and Rankin scoring at 6 months for any stroke, the overall serious cardiovascular event rate of periprocedural (within 30 days) disabling stroke, fatal myocardial infarction, and death at 30 days was 1.0%. Conclusions: Early ACST-2 results suggest contemporary carotid intervention for asymptomatic stenosis has a low risk of serious morbidity and mortality, on par with other recent trials. The trial continues to recruit, to monitor periprocedural events and all types of stroke, aiming to randomize up to 5,000 patients to determine any differential outcomes between interventions. Clinical trial: ISRCTN21144362. © 2013 European Society for Vascular Surgery. Published by Elsevier Ltd. All rights reserved

    Second asymptomatic carotid surgery trial (ACST-2): a randomised comparison of carotid artery stenting versus carotid endarterectomy

    Get PDF
    Background: Among asymptomatic patients with severe carotid artery stenosis but no recent stroke or transient cerebral ischaemia, either carotid artery stenting (CAS) or carotid endarterectomy (CEA) can restore patency and reduce long-term stroke risks. However, from recent national registry data, each option causes about 1% procedural risk of disabling stroke or death. Comparison of their long-term protective effects requires large-scale randomised evidence. Methods: ACST-2 is an international multicentre randomised trial of CAS versus CEA among asymptomatic patients with severe stenosis thought to require intervention, interpreted with all other relevant trials. Patients were eligible if they had severe unilateral or bilateral carotid artery stenosis and both doctor and patient agreed that a carotid procedure should be undertaken, but they were substantially uncertain which one to choose. Patients were randomly allocated to CAS or CEA and followed up at 1 month and then annually, for a mean 5 years. Procedural events were those within 30 days of the intervention. Intention-to-treat analyses are provided. Analyses including procedural hazards use tabular methods. Analyses and meta-analyses of non-procedural strokes use Kaplan-Meier and log-rank methods. The trial is registered with the ISRCTN registry, ISRCTN21144362. Findings: Between Jan 15, 2008, and Dec 31, 2020, 3625 patients in 130 centres were randomly allocated, 1811 to CAS and 1814 to CEA, with good compliance, good medical therapy and a mean 5 years of follow-up. Overall, 1% had disabling stroke or death procedurally (15 allocated to CAS and 18 to CEA) and 2% had non-disabling procedural stroke (48 allocated to CAS and 29 to CEA). Kaplan-Meier estimates of 5-year non-procedural stroke were 2·5% in each group for fatal or disabling stroke, and 5·3% with CAS versus 4·5% with CEA for any stroke (rate ratio [RR] 1·16, 95% CI 0·86–1·57; p=0·33). Combining RRs for any non-procedural stroke in all CAS versus CEA trials, the RR was similar in symptomatic and asymptomatic patients (overall RR 1·11, 95% CI 0·91–1·32; p=0·21). Interpretation: Serious complications are similarly uncommon after competent CAS and CEA, and the long-term effects of these two carotid artery procedures on fatal or disabling stroke are comparable. Funding: UK Medical Research Council and Health Technology Assessment Programme
    corecore